## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.1 ──
## ✓ broom     0.7.1      ✓ recipes   0.1.15
## ✓ dials     0.0.9      ✓ rsample   0.0.8 
## ✓ infer     0.5.3      ✓ tune      0.1.2 
## ✓ modeldata 0.1.0      ✓ workflows 0.2.1 
## ✓ parsnip   0.1.4      ✓ yardstick 0.0.7
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   team = col_character(),
##   WSWin = col_character()
## )
## See spec(...) for full column specifications.

Initial Plots vs Win Rate

With linear models overlayed and residual vs fitted plots. For the time being, only considering single-variable regression.

Batting Average (AVG)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -0.0166    0.0603    -0.275 7.84e- 1
## 2 mean_avg      1.99      0.232      8.57  1.20e-16
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.125         0.124 0.0667      73.5 1.20e-16     1   665. -1324. -1311.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Slugging Percentage (SLG)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -0.0203    0.0427    -0.477 6.34e- 1
## 2 mean_slg      1.25      0.102     12.2   2.57e-30
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.225         0.224 0.0627      149. 2.57e-30     1   696. -1386. -1374.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Isolated Power (ISO)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    0.265    0.0226      11.8 1.99e-28
## 2 mean_iso       1.50     0.143       10.5 2.21e-23
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.176         0.174 0.0647      110. 2.21e-23     1   680. -1354. -1342.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Batting Average on Balls in Play (BABIP)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)    0.103    0.0859      1.20 0.230     
## 2 mean_babip     1.33     0.289       4.62 0.00000488
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1    0.0399        0.0381 0.0699      21.3 4.88e-6     1   641. -1276. -1263.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

On Base Percentage (OBP)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -0.265    0.0617     -4.29 2.10e- 5
## 2 mean_obp       2.34     0.188      12.4  4.51e-31
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.231         0.229 0.0625      154. 4.51e-31     1   698. -1390. -1377.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Weighted On Base Average (wOBA)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -0.232    0.0560     -4.13 4.15e- 5
## 2 mean_w_oba     2.26     0.173      13.1  6.66e-34
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.250         0.248 0.0617      171. 6.66e-34     1   704. -1403. -1390.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Weighted Runs Created (wRC)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) -0.0404   0.0267       -1.51 1.31e- 1
## 2 mean_w_rc    0.00561  0.000276     20.3  7.25e-68
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.446         0.445 0.0530      414. 7.25e-68     1   783. -1559. -1546.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Base Runs (BsR)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic     p.value
##   <chr>          <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)   0.498    0.00309    161.   0          
## 2 mean_bsr      0.0184   0.00367      5.01 0.000000749
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1    0.0467        0.0448 0.0696      25.1 7.49e-7     1   643. -1279. -1267.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Offense (Off)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  0.469    0.00286      164.  0.      
## 2 mean_off     0.00905  0.000467      19.4 3.03e-63
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.423         0.422 0.0542      376. 3.03e-63     1   772. -1538. -1525.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Defensive Runs Saved (DRS)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  0.496     0.00336    148.   0.      
## 2 mean_drs     0.00748   0.00114      6.58 1.46e-10
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1    0.0966        0.0944 0.0666      43.3 1.46e-10     1   526. -1046. -1034.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Ultimate Zone Rating (UZR)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)  0.497     0.00338    147.   0         
## 2 mean_uzr     0.00615   0.00125      4.92 0.00000124
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1    0.0536        0.0514 0.0691      24.2 1.24e-6     1   538. -1071. -1059.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Ultimate Zone Rating per 150 Games (UZR/150)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)  0.500    0.00334     150.   0         
## 2 mean_uzr150  0.00418  0.000844      4.95 0.00000107
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1    0.0543        0.0521 0.0691      24.5 1.07e-6     1   539. -1071. -1059.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Defense (Def)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  0.486     0.00371    131.   0.      
## 2 mean_def     0.00794   0.00105      7.52 3.18e-13
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.117         0.115 0.0668      56.6 3.18e-13     1   553. -1101. -1088.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Wins Above Replacement (WAR)
## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   0.364    0.00600      60.6 6.67e-236
## 2 mean_war      0.0778   0.00320      24.3 2.20e- 87
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.535         0.534 0.0486      591. 2.20e-87     1   828. -1649. -1636.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Analysis of Initial Plots

Observation of residual vs. fitted plots shows no discernible patterns, suggesting that the linear model is an adequate fit in all cases. Observation of adjusted R squared values shows that the linear models of WAR, Off, and wRC vs Win Rate are best able to account for variance in data (values of 0.534, 0.442, and 0.445 respectively). Observation of plots does not suggest any cases of low R squared resulting from a high number of outliers.

Additional Plots

Examining how well the models for WAR, wRC, and Off function as predictors.

## # A tibble: 128 x 3
##    .pred win_rate mean_war
##    <dbl>    <dbl>    <dbl>
##  1 0.541    0.488    2.28 
##  2 0.484    0.475    1.54 
##  3 0.564    0.549    2.58 
##  4 0.437    0.333    0.942
##  5 0.630    0.630    3.42 
##  6 0.438    0.447    0.961
##  7 0.437    0.512    0.939
##  8 0.512    0.457    1.91 
##  9 0.500    0.543    1.75 
## 10 0.534    0.580    2.19 
## # … with 118 more rows
## # A tibble: 128 x 4
##    .pred win_rate mean_war pred_diff
##    <dbl>    <dbl>    <dbl>     <dbl>
##  1 0.541    0.488    2.28   0.0531  
##  2 0.484    0.475    1.54   0.00831 
##  3 0.564    0.549    2.58   0.0150  
##  4 0.437    0.333    0.942  0.104   
##  5 0.630    0.630    3.42   0.000221
##  6 0.438    0.447    0.961  0.00880 
##  7 0.437    0.512    0.939  0.0757  
##  8 0.512    0.457    1.91   0.0551  
##  9 0.500    0.543    1.75   0.0435  
## 10 0.534    0.580    2.19   0.0460  
## # … with 118 more rows
## [1] 0.03931678
## [1] 0.03465513
## [1] 0.03475607
## [1] 0.02854189